AITopics | failure pattern

Collaborating Authors

failure pattern

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Why Do Multi-Agent LLM Systems Fail?

Neural Information Processing SystemsJun-13-2026, 21:47:42 GMT

Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail. Addressing this question requires systematic identification and analysis of failure patterns. We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks. MAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems.

artificial intelligence, mast-data, proceedings, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

Add feedback

How Do LLMs Fail In Agentic Scenarios? A Qualitative Analysis of Success and Failure Scenarios of Various LLMs in Agentic Simulations

Roig, JV

arXiv.org Artificial IntelligenceDec-10-2025

We investigate how large language models (LLMs) fail when operating as autonomous agents with tool-use capabilities. Using the Kamiwaza Agentic Merit Index (KAMI) v0.1 benchmark, we analyze 900 execution traces from three representative models - Granite 4 Small, Llama 4 Maverick, and DeepSeek V3.1 - across filesystem, text extraction, CSV analysis, and SQL scenarios. Rather than focusing on aggregate scores, we perform fine-grained, per-trial behavioral analysis to surface the strategies that enable successful multi-step tool execution and the recurrent failure modes that undermine reliability. Our findings show that model scale alone does not predict agentic robustness: Llama 4 Maverick (400B) performs only marginally better than Granite 4 Small (32B) in some uncertainty-driven tasks, while DeepSeek V3.1's superior reliability derives primarily from post-training reinforcement learning rather than architecture or size. Across models, we identify four recurring failure archetypes: premature action without grounding, over-helpfulness that substitutes missing entities, vulnerability to distractor-induced context pollution, and fragile execution under load. These patterns highlight the need for agentic evaluation methods that emphasize interactive grounding, recovery behavior, and environment-aware adaptation, suggesting that reliable enterprise deployment requires not just stronger models but deliberate training and design choices that reinforce verification, constraint discovery, and adherence to source-of-truth data.

granite 4, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2512.07497

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation

Kim, Subin, Mo, Sangwoo, Rizve, Mamshad Nayeem, Xu, Yiran, Liu, Difan, Shin, Jinwoo, Hinz, Tobias

arXiv.org Artificial IntelligenceDec-4-2025

Achieving precise alignment between user intent and generated visuals remains a central challenge in text-to-visual generation, as a single attempt often fails to produce the desired output. To handle this, prior approaches mainly scale the visual generation process (e.g., increasing sampling steps or seeds), but this quickly leads to a quality plateau. This limitation arises because the prompt, crucial for guiding generation, is kept fixed. To address this, we propose Prompt Redesign for Inference-time Scaling, coined PRIS, a framework that adaptively revises the prompt during inference in response to the scaled visual generations. The core idea of PRIS is to review the generated visuals, identify recurring failure patterns across visuals, and redesign the prompt accordingly before regenerating the visuals with the revised prompt. To provide precise alignment feedback for prompt revision, we introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level, achieving more accurate and interpretable assessments than holistic measures. Extensive experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach, including a 15% gain on VBench 2.0. These results highlight that jointly scaling prompts and visuals is key to fully leveraging scaling laws at inference-time. Visualizations are available at the website: https://subin-kim-cv.github.io/PRIS.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2512.03534

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Red-teaming Activation Probes using Prompted LLMs

Blandfort, Phil, Graham, Robert

arXiv.org Artificial IntelligenceNov-4-2025

Activation probes are attractive monitors for AI systems due to low cost and latency, but their real-world robustness remains underexplored. We ask: What failure modes arise under realistic, black-box adversarial pressure, and how can we surface them with minimal effort? We present a lightweight black-box red-teaming procedure that wraps an off-the-shelf LLM with iterative feedback and in-context learning (ICL), and requires no fine-tuning, gradients, or architectural access. Running a case study with probes for high-stakes interactions, we show that our approach can help discover valuable insights about a SOT A probe. Our analysis uncovers interpretable brittleness patterns (e.g., legalese-induced FPs; bland procedural tone FNs) and reduced but persistent vulnerabilities under scenario-constraint attacks. These results suggest that simple prompted red-teaming scaffolding can anticipate failure patterns before deployment and might yield promising, actionable insights to harden future probes.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.00554

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.48)

Industry:

Health & Medicine (1.00)
Government (0.94)
Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation

Luo, Guoqing, Maab, Iffat, Mou, Lili, Yamagishi, Junichi

arXiv.org Artificial IntelligenceOct-21-2025

While reasoning-based large language models excel at complex tasks through an internal, structured thinking process, a concerning phenomenon has emerged that such a thinking process can aggregate social stereotypes, leading to biased outcomes. However, the underlying behaviours of these language models in social bias scenarios remain underexplored. In this work, we systematically investigate mechanisms within the thinking process behind this phenomenon and uncover two failure patterns that drive social bias aggregation: 1) stereotype repetition, where the model relies on social stereotypes as its primary justification, and 2) irrelevant information injection, where it fabricates or introduces new details to support a biased narrative. Building on these insights, we introduce a lightweight prompt-based mitigation approach that queries the model to review its own initial reasoning against these specific failure patterns. Experiments on question answering (BBQ and StereoSet) and open-ended (BOLD) benchmarks show that our approach effectively reduces bias while maintaining or improving accuracy.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.17062

Country: North America > Canada (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

Butterfly Effects in Toolchains: A Comprehensive Analysis of Failed Parameter Filling in LLM Tool-Agent Systems

Xiong, Qian, Huang, Yuekai, Jiang, Ziyou, Chang, Zhiyuan, Zheng, Yujia, Li, Tianhao, Li, Mingyang

arXiv.org Artificial IntelligenceJul-22-2025

The emergence of the tool agent paradigm has broadened the capability boundaries of the Large Language Model (LLM), enabling it to complete more complex tasks. However, the effectiveness of this paradigm is limited due to the issue of parameter failure during its execution. To explore this phenomenon and propose corresponding suggestions, we first construct a parameter failure taxonomy in this paper. We derive five failure categories from the invocation chain of a mainstream tool agent. Then, we explore the correlation between three different input sources and failure categories by applying 15 input perturbation methods to the input. Experimental results show that parameter name hallucination failure primarily stems from inherent LLM limitations, while issues with input sources mainly cause other failure patterns. To improve the reliability and effectiveness of tool-agent interactions, we propose corresponding improvement suggestions, including standardizing tool return formats, improving error feedback mechanisms, and ensuring parameter consistency.

information, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2507.15296

Genre: Research Report > New Finding (0.66)

Industry: Information Technology (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

From Failures to Fixes: LLM-Driven Scenario Repair for Self-Evolving Autonomous Driving

Xia, Xinyu, Ma, Xingjun, Hu, Yunfeng, Qu, Ting, Chen, Hong, Gong, Xun

arXiv.org Artificial IntelligenceMay-29-2025

Ensuring robust and generalizable autonomous driving requires not only broad scenario coverage but also efficient repair of failure cases, particularly those related to challenging and safety-critical scenarios. However, existing scenario generation and selection methods often lack adaptivity and semantic relevance, limiting their impact on performance improvement. In this paper, we propose \textbf{SERA}, an LLM-powered framework that enables autonomous driving systems to self-evolve by repairing failure cases through targeted scenario recommendation. By analyzing performance logs, SERA identifies failure patterns and dynamically retrieves semantically aligned scenarios from a structured bank. An LLM-based reflection mechanism further refines these recommendations to maximize relevance and diversity. The selected scenarios are used for few-shot fine-tuning, enabling targeted adaptation with minimal data. Experiments on the benchmark show that SERA consistently improves key metrics across multiple autonomous driving baselines, demonstrating its effectiveness and generalizability under safety-critical conditions.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2505.22067

Genre: Research Report (0.50)

Industry:

Transportation > Ground > Road (1.00)
Information Technology > Robotics & Automation (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Interactive Visual Assessment for Text-to-Image Generation Models

Mi, Xiaoyue, Tang, Fan, Cao, Juan, Sheng, Qiang, Huang, Ziyao, Li, Peng, Liu, Yang, Lee, Tong-Yee

arXiv.org Artificial IntelligenceNov-23-2024

Visual generation models have achieved remarkable progress in computer graphics applications but still face significant challenges in real-world deployment. Current assessment approaches for visual generation tasks typically follow an isolated three-phase framework: test input collection, model output generation, and user assessment. These fashions suffer from fixed coverage, evolving difficulty, and data leakage risks, limiting their effectiveness in comprehensively evaluating increasingly complex generation models. To address these limitations, we propose DyEval, an LLM-powered dynamic interactive visual assessment framework that facilitates collaborative evaluation between humans and generative models for text-to-image systems. DyEval features an intuitive visual interface that enables users to interactively explore and analyze model behaviors, while adaptively generating hierarchical, fine-grained, and diverse textual inputs to continuously probe the capability boundaries of the models based on their feedback. Additionally, to provide interpretable analysis for users to further improve tested models, we develop a contextual reflection module that mines failure triggers of test inputs and reflects model potential failure patterns supporting in-depth analysis using the logical reasoning ability of LLM. Qualitative and quantitative experiments demonstrate that DyEval can effectively help users identify max up to 2.56 times generation failures than conventional methods, and uncover complex and rare failure patterns, such as issues with pronoun generation and specific cultural context generation. Our framework provides valuable insights for improving generative models and has broad implications for advancing the reliability and capabilities of visual generation systems across various domains.

large language model, machine learning, test input, (17 more...)

arXiv.org Artificial Intelligence

2411.15509

Country:

Asia > China > Beijing > Beijing (0.04)
South America > Brazil (0.04)
North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

WebSuite: Systematically Evaluating Why Web Agents Fail

Li, Eric, Waldo, Jim

arXiv.org Artificial IntelligenceMay-31-2024

We describe WebSuite, the first diagnostic benchmark for generalist web agents, designed to systematically evaluate why agents fail. Advances in AI have led to the rise of numerous web agents that autonomously operate a browser to complete tasks. However, most existing benchmarks focus on strictly measuring whether an agent can or cannot complete a task, without giving insight on why. In this paper, we 1) develop a taxonomy of web actions to facilitate identifying common failure patterns, and 2) create an extensible benchmark suite to assess agents' performance on our taxonomized actions. This benchmark suite consists of both individual tasks, such as clicking a button, and end-to-end tasks, such as adding an item to a cart, and is designed such that any failure of a task can be attributed directly to a failure of a specific web action. We evaluate two popular generalist web agents, one text-based and one multimodal, and identify unique weaknesses for each agent. Because WebSuite can disaggregate task failures into specific action failures, this enables granular identification of which UX flows an individual agent has trouble with and immediately highlights promising avenues for improvement. These findings highlight the need for more focused benchmarking on where web agents go wrong to effectively improve agents beyond their weaker performance today.

agent, benchmark, interaction, (17 more...)

arXiv.org Artificial Intelligence

2406.01623

Country:

North America > United States > New York (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada > Ontario > Toronto (0.04)
Africa > Eswatini > Manzini > Manzini (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Communications > Web (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Efficient Failure Pattern Identification of Predictive Algorithms

Nguyen, Bao, Nguyen, Viet Anh

arXiv.org Artificial IntelligenceJun-1-2023

Given a (machine learning) classifier and a collection of unlabeled data, how can we efficiently identify misclassification patterns presented in this dataset? To address this problem, we propose a human-machine collaborative framework that consists of a team of human annotators and a sequential recommendation algorithm. The recommendation algorithm is conceptualized as a stochastic sampler that, in each round, queries the annotators a subset of samples for their true labels and obtains the feedback information on whether the samples are misclassified. The sampling mechanism needs to balance between discovering new patterns of misclassification (exploration) and confirming the potential patterns of classification (exploitation). We construct a determinantal point process, whose intensity balances the exploration-exploitation trade-off through the weighted update of the posterior at each round to form the generator of the stochastic sampler. The numerical results empirically demonstrate the competitive performance of our framework on multiple datasets at various signal-to-noise ratios.

artificial intelligence, failure pattern, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2306.0076

Country:

North America > United States (0.46)
Asia > Vietnam (0.14)

Genre: Research Report (0.82)

Industry:

Health & Medicine (0.68)
Energy > Oil & Gas > Upstream (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.86)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback